By:
1. BING HONGJIAN - A0172543H
2. BRYON LIEW KAI RONG - A0168236Y
3. LIM WEE SHENG GAVIN - A0166830A
4. ZHOU KAI JING - A0171366A
image
Bike-sharing is a service where bicycles are made available for shared use to individuals on a short-term basis. Generally, these businesses follow three distinct business models: Docking stations, geo-fencing and free-float models.
Firstly, the docking station models only allows bicycles to be borrowed and returned from a designated docking stations. Secondly, the geo-fencing model allows bike hires to be ended within or outside of a designed virtual fence. And finally, free-floated bikes allows the users to drop off their bikes at any location within a city’s boundaries.
In Singapore, the bike sharing industry is dominated by free float bikes that are being offered by several private companies. The industry began with the entrance of a local bike sharing company, Obike, in early 2017. The entrance of Obike was the followed by Ofo and Mobike, who were both established companies in the bike-sharing market, with a huge presence in China. However, in recent months it seems that some companies have emerged as losers in the fierce competition, facing losses due to illegal parking and damage to bikes. In early 2019, Mobike had announced its plans to exit the market and as such, none of these original companies remain in Singapore.
After surveying the different bicycle parking stations located in Singapore, our team has also come to realise a separate problem: The uneven spread of shared bicycles across different the stations in Singapore. What this means is that at some stations, we observe a surplus of bicycles to docks and there are more parked bicycles than the number of docks allocated to that particular station. On the other hand, some of the stations are facing a shortage of bicycles and the docks are totally barren with no bicycles parked at all.
This report will describe our team’s project in creating an application, and display its functions in fixing the problems these bike sharing companies are facing. We will accomplish this by analysing data of bike-sharing services across different stations in San Francisco’s Bay Area in 2014. From our analysis, a solution will be provided to bike sharing companies to help them distribute the number of bikes across different stations and resolve the problem of shortages and surplus of bicycles in their stations.
Currently, shared bicycles work based on a QR code system. The QR codes are a sort of “key” that allows to bicycles to locked, unlocked and parked.
The codes must first be scanned through the company’s application, after which the bike can will be unlocked and can be used. When returning their bicycles, customers must first bring their bikes to a designated parking station and scan a QR code located at the parking zone to end their trip, park and lock their bikes.
These parking zones contain a geo-fencing system which only enables bikers to park and lock the bikes when the bikes are within the virtual perimeter (size of the parking zones) assigned by the biking companies.
All stations follow a similar design with 3 separate sizes: Small, Medium and Large. The different sized stations have different numbers of docks to accomodate parked bicycles, 5 at small stations, 8 at medium stations and 10 at large stations. These stations are located according to the expected traffic in the area, with large stations often located close to MRT stations and bus interchanges.
This current practice of locating parking stations according to the expected traffic in the area is certainly logical as areas with a high flow of possible customers will require more bikes. However, when examining the data we still observe small and medium stations with a shortage of bikes and large stations with a surplus. Our projects aims to reduce this problem.
Unlike other cities like London and Taipei, Singapore bike sharing businesses follow a free-floating system where bike users can park the bikes anywhere without penalties. This practice has resulted in the issue of indiscriminate parking by bike users. To tackle this problem, the Land Transport Authority (LTA) has announced new regulations for dockless shared bike operators as well as the users. As reported by the Straits Times in September 2018, LTA will be installing Quick Response (QR) codes at public bicycle parking places to ensure that bikes are parked correctly. Furthermore, errant users will be charged $5 by licensed operators each time they park these bicycles indiscriminately.
However, given the limited space for bicycles in parking stations, other problems might arise as well. We foresee that this new law may cause problems of bicycle overcrowding at destinations that are popular end points, but less popular start points. Whereas areas that are popular start points, but less popular end points might faces issues with bicycle shortages throughout the day.
Here is an image of some cases of overcrowding at bicycle stations:
Furthermore, timing is a factor that creates bike shortage or excess. For example, at places like the CBD area and during peak hours, the demand for renting a bike is likely to be high. This causes bike shortages at stations in these area and customers may not be able to rent a bike at the station.
Under the new regulation, bike users must park their bikes at the designated QR code areas if they do not want to be fined. However, given the limited space at each station, there may be overcrowding problem as bike users squeeze and dump their bikes anywhere within the virtual fence. This will cause more incidents of indiscriminate parking.
Therefore, allowing customers to know the number of bikes available at the stations and to visualise the demand of bikes at certain stations under certain time intervals are essential in helping to deal with such problems
Initially, our group intended to use data that from local bike sharing companies that operate in Singapore such as OFO, Mobike and SGBike. They run their businesses locally and are more likely to be relevant for the purposes of this project. However, in the course of the project, taking into consideration the availability of these datasets and that most of the bicycle sharing companies such as OFO, Mobike and oBike have either exited or are planning to exit the local market, we have decided to use an alternative source of bicycle sharing data from San Francisco Bay Area.
The final dataset we have chosen to use can be obtained from this link: Data Source
library(chron)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(gplots)
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(ggplot2)
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(ggthemes)
library(leaflet)
library(readr)
library(reshape2)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:reshape2':
##
## smiths
setwd(getwd())
StationData <- read_csv("station.csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## name = col_character(),
## lat = col_double(),
## long = col_double(),
## dock_count = col_double(),
## city = col_character(),
## installation_date = col_character()
## )
TripData <- read_csv("trip.csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## duration = col_double(),
## start_date = col_character(),
## start_station_name = col_character(),
## start_station_id = col_double(),
## end_date = col_character(),
## end_station_name = col_character(),
## end_station_id = col_double(),
## bike_id = col_double(),
## subscription_type = col_character(),
## zip_code = col_double()
## )
## Warning: 11058 parsing failures.
## row col expected actual file
## 11506 zip_code no trailing characters -2585 'trip.csv'
## 15287 zip_code no trailing characters -2585 'trip.csv'
## 15409 zip_code no trailing characters -2585 'trip.csv'
## 21731 zip_code no trailing characters -2585 'trip.csv'
## 32654 zip_code no trailing characters -2585 'trip.csv'
## ..... ........ ...................... ...... ..........
## See problems(...) for more details.
WeatherData <- read_csv("weather.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## date = col_character(),
## precipitation_inches = col_character(),
## events = col_character()
## )
## See spec(...) for full column specifications.
The dataset consists of 4 different csv files: station, status, weather and trips data.
In order to facilitate the process of data analysis, data cleaning was performed after the acquisition of data by merging Stations, Status, and Weather into a single dataframe. From the combined dataframe, data was aggregated on an hourly average basis for us to understand the hourly trends on the availability of the bicycles and docks data for each station. In addition, we can also draw insights on the usage rate from these two metrics.
if (!("Status9" %in% ls())) {
if (file.exists("Data_cleaned.csv")) {
Status9 <- read_csv("Data_cleaned.csv")
} else {
StatusData <- read_csv("status.csv")#import bike station status data
StatusData$Year <- as.numeric(substr(StatusData$time,1,4)) #extract year and form a new column
StatusData <- StatusData[StatusData$Year==2014,] #only analysing in year 2014
StatusData$Date <- paste0(substr(StatusData$time,9,10),"/",substr(StatusData$time,6,7),"/",substr(StatusData$time,1,4)) #add date
StatusData$Hour <- as.numeric(substr(StatusData$time,12,13)) #add hour column
StatusData$Min <- as.numeric(substr(StatusData$time,15,16)) #add minute column
StatusData <- StatusData[,-4] #remove original time column
StatusData$Date <- as.Date(StatusData$Date, "%d/%m/%Y")
Status2 <- full_join(StatusData,StationData[,c(-7)],by=c("station_id"="id"))
Status3 <- Status2[,-2]
Status4 <- Status2[,-3]
Status5 <- Status3 %>%
group_by(station_id, Date, Year, Hour, name, lat, long, dock_count, city) %>%
summarize(Average_Docks = mean(docks_available))
Status6 <- Status4 %>%
group_by(station_id, Date, Year, Hour, name, lat, long, dock_count, city) %>%
summarize(Average_Bikes = mean(bikes_available))
Status7 <- full_join(Status6, Status5)
Status7$Usage_Rate <- Status7$Average_Docks/(Status7$Average_Bikes + Status7$Average_Docks)
Status7$Date <- as.Date(Status7$Date, "%d/%m/%Y")
WeatherData$date <- as.Date(WeatherData$date, "%m/%d/%Y")
WeatherData$Year <- as.numeric(strftime(WeatherData$date, "%Y"))
WeatherData <- WeatherData[WeatherData$Year==2014,]
Status8 <- full_join(Status7,WeatherData,by=c("Date"="date"))
na_count <- sapply(Status8, function(y) sum(length(which(is.na(y)))))
na_count <- data.frame(na_count)
Status9 <- Status8[complete.cases(Status8[,c("mean_temperature_f","mean_humidity","cloud_cover")]),]
names(Status9)[names(Status9) == 'Year.x'] <- 'Year'
Status9 <- Status9[,-ncol(Status9)]
write.csv(Status9, file = "Data_cleaned.csv", row.names = FALSE)
}
}
## Parsed with column specification:
## cols(
## .default = col_double(),
## Date = col_date(format = ""),
## name = col_character(),
## city = col_character(),
## precipitation_inches = col_character(),
## events = col_character()
## )
## See spec(...) for full column specifications.
Before further analysis on weather effects on bicycle usage, a correlation matrix was generated for the three selected variables of analysis: Mean Temperature, Mean Humidity, and Cloud Cover.
cor(Status9[,c("mean_temperature_f","mean_humidity","cloud_cover")], use="complete")
## mean_temperature_f mean_humidity cloud_cover
## mean_temperature_f 1.0000000 -0.2254128 -0.1827273
## mean_humidity -0.2254128 1.0000000 0.5756338
## cloud_cover -0.1827273 0.5756338 1.0000000
From the matrix, we can assume that there is no interaction between all the independent variables since their correlation coefficients do not exceed the threshold of 0.8, hence indicating no multicollinearity.
anova.twoway <- aov(Usage_Rate ~ mean_temperature_f + mean_humidity + cloud_cover, data = Status9)
summary(anova.twoway)
## Df Sum Sq Mean Sq F value Pr(>F)
## mean_temperature_f 1 63 63.39 1872.88 < 2e-16 ***
## mean_humidity 1 1 1.25 36.94 1.22e-09 ***
## cloud_cover 1 8 7.91 233.70 < 2e-16 ***
## Residuals 2945130 99687 0.03
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Multiple Linear Regression
MLRModel <- lm(Usage_Rate ~ mean_temperature_f + mean_humidity + cloud_cover, data=Status9)
summary(MLRModel)
##
## Call:
## lm(formula = Usage_Rate ~ mean_temperature_f + mean_humidity +
## cloud_cover, data = Status9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.53555 -0.11746 0.00732 0.11838 0.49396
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.690e-01 1.335e-03 351.34 <2e-16 ***
## mean_temperature_f 6.937e-04 1.635e-05 42.42 <2e-16 ***
## mean_humidity 1.549e-04 1.141e-05 13.57 <2e-16 ***
## cloud_cover -8.823e-04 5.772e-05 -15.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.184 on 2945130 degrees of freedom
## Multiple R-squared: 0.0007273, Adjusted R-squared: 0.0007263
## F-statistic: 714.5 on 3 and 2945130 DF, p-value: < 2.2e-16
Subsequently, further ANOVA and multiple linear regression analysis were then performed on the cleaned data and we found out that there was a correlation between the Usage Rate and all the independent variables of mean temperature, mean humidity and cloud cover. Since the p-value of ANOVA analysis falls below the statistical significance level of 0.05, we can conclude that there all 3 weather variables of analysis are good indicators on usage of bicycles. Using this understanding, we can help consumers predict with greater confidence the availability of bicycles and docks at each location for their bicycle sharing service.
Our team has also included multiple geo-spatial mapping functions in the application that allows us to see many trends in the spatial distribution of Bicycles in San Francisco’s Bay area.
register_google(key='AIzaSyBKNZLfGl7FCDNmIhZmibexeWPBKl5XWXI')
bay_area_map<-get_map(location = c(mean(StationData$long),mean(StationData$lat)),source='google',zoom=10)
## Source : https://maps.googleapis.com/maps/api/staticmap?center=37.590243,-122.218416&zoom=10&size=640x640&scale=2&maptype=terrain&language=en-EN&key=xxx
ggmap(bay_area_map)+
geom_point(data=StationData,aes(x=long,y=lat,color=city,size=dock_count),alpha=0.5)+
ggtitle('Bike Stations in Bay Area')+
theme_map()
leaflet()%>%addTiles() %>%
addMarkers(data =StationData, lng = ~long, lat = ~lat,popup = ~name,
clusterOptions = markerClusterOptions())
From the map, we observe that the stations are mainly located at 5 different cities in the Bay Area. These cities are Mountain View, Palo Alto, Redwood City, San Francisco and San Jose.
#number of stations by city
station_counts <- count(StationData, vars = city)
colnames(station_counts) <- c("City", "Number_of_Stations")
station_counts
## # A tibble: 5 x 2
## City Number_of_Stations
## <chr> <int>
## 1 Mountain View 7
## 2 Palo Alto 5
## 3 Redwood City 7
## 4 San Francisco 35
## 5 San Jose 16
#number of docks by city
dock_counts <- aggregate(StationData$dock_count, by=list(city=StationData$city), FUN=sum)
colnames(dock_counts) <- c("City", "Number_of_Docks")
dock_counts
## City Number_of_Docks
## 1 Mountain View 117
## 2 Palo Alto 75
## 3 Redwood City 115
## 4 San Francisco 665
## 5 San Jose 264
#Plotting number of docks and stations by city
dock_station_counts <- left_join(station_counts,dock_counts, by = "City")
dock_station_counts <- dock_station_counts %>% gather(key, value, -City)
ggplot(dock_station_counts, aes(x = City, y = value)) +
geom_bar(aes(fill = key), stat="identity", position = "dodge") +
ggtitle("Plot of number of Docks and Stations in a City") +
ylab("Number of Stations")
From the bar graph above, it can be seen that San Francisco and San Jose have the most number of stations and bicycle docks in the Bay Area.
Before going fully into geo-spatial analysis of the data, we first clean the data.
#firstly, filter trips with arrive/departure stations that are not in the 'station data set'
stations<-union(unique(TripData$end_station_name),unique(TripData$start_station_name))
setdiff(unique(TripData$start_station_name),StationData$name)
## [1] "Broadway at Main" "San Jose Government Center"
## [3] "Post at Kearny" "Washington at Kearny"
sta<-setdiff(unique(TripData$end_station_name),StationData$name) #identified 4 stations, remove them
tripdata<-TripData[(!TripData$start_station_name %in% sta & !TripData$end_station_name%in% sta),]
#clean the trip data, such as convert the duration from seconds to minutes
tripdata$duration<-tripdata$duration/60
#check the pattern
ggplot(data=tripdata)+
geom_boxplot(aes(y=duration))
#remove the outliers, top 0.1%
quantile(tripdata$duration,0.999)
## 99.9%
## 1200.427
tripdata<-tripdata[tripdata$duration<quantile(tripdata$duration,0.999),]
#sort the stations based on their latitude so that the stations are more close to each other in geography
stationdata<-StationData[order(StationData$lat,decreasing = T),]
With the data cleaned, we now conduct a network analysis in order to view the number of trips between each each station.
trip_count<-matrix(0,ncol=length(unique(tripdata$start_station_name)),nrow=length(unique(tripdata$end_station_name)))
rownames(trip_count)<-stationdata$name
colnames(trip_count)<-stationdata$name
for(i in 1:length(tripdata$id)){
sta=tripdata[i,'start_station_name'][[1]]
end=tripdata[i,'end_station_name'][[1]]
trip_count[sta,end]=trip_count[sta,end]+1
}
We then plotted a heat map to describe the relationship between trips between stations.
b<-log(trip_count+1) #use log value to plot heatmap to view the relationship between stations
heatmap.2(b, Rowv=FALSE, symm=TRUE, margin=c(6, 6), trace="none")
## Warning in heatmap.2(b, Rowv = FALSE, symm = TRUE, margin = c(6, 6), trace
## = "none"): Discrepancy: Rowv is FALSE, while dendrogram is `both'. Omitting
## row dendogram.
## Warning in heatmap.2(b, Rowv = FALSE, symm = TRUE, margin = c(6, 6),
## trace = "none"): Discrepancy: Colv is FALSE, while dendrogram is `column'.
## Omitting column dendogram.
From the heatmap, we can observe a trip pattern where trips are commonly seen to be intra-city instead of inter-city. This indicates to us that it is more useful to look at the data city by city instead of as a whole when we study the bike problem.
Next, our team calculated the number of trips starting and ending from each station. This allows us to calculate the in-degree and out-degree centrality of each node (station) in this network graph.
in_centrality<-as.data.frame(table(tripdata$start_station_name))
colnames(in_centrality)<-c('name','In_Degree_Centrality')
out_centrality<-as.data.frame(table(tripdata$end_station_name))
colnames(out_centrality)<-c('name','Out_Degree_Centrality')
After that, the difference of [out degree] - [in degree] is calculated. By dividing the average of the two degree, we could get an index the severtiy of bike excess (diff<0) or bike shortage (diff>0).
station_info<-full_join(in_centrality,out_centrality)
## Joining, by = "name"
station_info<-full_join(stationdata,station_info)
## Joining, by = "name"
## Warning: Column `name` joining character vector and factor, coercing into
## character vector
Finally, wecalculate the different between bikes in and out from each station and label the areas as facing a shortage or excess respectively.
station_info$Diff<-station_info$In_Degree_Centrality - station_info$Out_Degree_Centrality
station_info$Avg<-(station_info$In_Degree_Centrality + station_info$Out_Degree_Centrality) /2
station_info$index<- station_info$Diff / station_info$Avg
station_info$State<-ifelse(station_info$index>0,'Excess','Shortage')
if (!(file.exists("Data_station_info.csv"))) {
write.csv(station_info, file = "Data_station_info.csv", row.names = FALSE)
}
ggmap(bay_area_map)+
geom_point(data=station_info,aes(x=long,y=lat,color=State,size=abs(index)),alpha=0.5)+
scale_size_continuous(guide = FALSE)+
ggtitle('Bike Stations in Bay Area ')+
theme_map()
pal <- colorFactor(c("blue", "red"), domain = c("Shortage", "Excess"),levels=c('Shortage','Excess'))
leaflet(station_info) %>% addTiles() %>%
addCircleMarkers(lng=~long,lat=~lat,
radius = ~abs(index)*20,
color = ~pal(State), popup = ~name,
stroke = FALSE, fillOpacity = 0.5
)
if (!(file.exists("Data_cleaned_trip.csv"))) {
write.csv(tripdata, file = "Data_cleaned_trip.csv", row.names = FALSE)
}
Plotting these areas of excess and shortage on the map of San Francisco, We can then see that stations in SF are mostly facing shortage issue as average bikes returning is less than bikes needed.
Next, we will work with the cleaned “Status9” data to view bike and dock availability across stations. To do this, we first began with San Jose Caltrain Station.
test_set<-Status9[1:5,]
leaflet() %>% addTiles() %>%
addMarkers(data = test_set, lng = ~long, lat = ~lat,
popup = ~name)
PlotData<-Status9[(Status9$Date=='2014-05-02' & Status9$Hour=='12'& Status9$name=='San Jose Diridon Caltrain Station'),c(5,6,7,8,10,11) ]
In addition to the map locations, additional information such as the number of bikes an docks available at the station can be found. This information will be valuable for customers as it allows them to go to stations where there is a supply of bicycles for use. This also allows them to choose the specific station and time they want information on.
#check bike availability
popup<-paste(sep='<br/>',
PlotData$name, paste(sep=' ','Bike available:',PlotData$Average_Bikes),
paste(sep=' ','Docks available:',PlotData$Average_Docks),
paste(sep=' ','Total Docks here:',PlotData$dock_count)
)
leaflet() %>% addTiles() %>%
addMarkers(data=PlotData,lng = ~long, lat = ~lat,popup=popup
) #customer can view bike and dock availability from the map
#
San_Jose_Data<-Status9[(Status9$Date=='2014-05-02' & Status9$Hour=='12'& Status9$city=='San Jose'),c(5,6,7,8,10,11) ]
popup1<-paste(sep='<br/>',
San_Jose_Data$name, paste(sep=' ','Bike available:',San_Jose_Data$Average_Bikes),
paste(sep=' ','Docks available:',San_Jose_Data$Average_Docks),
paste(sep=' ','Total Docks here:',San_Jose_Data$dock_count)
)
leaflet() %>% addTiles() %>%
addMarkers(data=San_Jose_Data,lng = ~long, lat = ~lat,popup=~popup1
) #plot all stations status at San Jose on 2014-05-02, and at 12 pm.
In additional to geo-spatial analysis, our app will include various graphs that allow customers and the company to understand key distributions within the data.
The first graphical analysis we conducted was to understand the distribution of trip durations across customers.
#Density plot of trip duration,for duration wintin 60 minutes
ggplot(tripdata, aes(x = duration)) +
geom_density() +
xlim(0,60) +
ggtitle("Density plot of trip duration")
## Warning: Removed 19808 rows containing non-finite values (stat_density).
#Density plots of trip duration by subscription type
subscription_trip <- tripdata[,c(2,10)]
ggplot(subscription_trip, aes(x = duration, fill = subscription_type)) +
geom_density(alpha=0.5) +
xlim(0,60) +
ggtitle("Density plots of trip duration by subscription type")
## Warning: Removed 19808 rows containing non-finite values (stat_density).
We observed that a large proportion of trips last within 3 to 15minutes. We then distinguished the duration between the subscription type of customers and found out that guest customers have a much wider spread of trip durations, with more opting for longer rides as compared to the shorter and more consistent rides of subscribers.
Our team concluded that subscribers of bike sharing services probably use these services to get to-and-from a destination and thus carry out the same trip multiple times. Guest customers however, may have different purposes of renting the bikes such and may use the bikes for extended periods of time for sightseeing or exercise.
Next, our team conducted graphical analysis by filtering the data according to months, days and hours. To do this, we first created new columns to represent the data, month, hour and minute into trips data. This data is taken from the starting time of the trip
date.time<-t(as.data.frame(strsplit(tripdata$start_date,' ')))
row.names(date.time) = NULL
tripdata$date<-as.Date(date.time[,1],"%m/%d/%Y")
tripdata$time<-date.time[,2]
tripdata$Month<-format(tripdata$date,'%m')
tripdata$Month<-as.numeric(tripdata$Month)
mymonths <- c("January","February","March",
"April","May","June",
"July","August","September",
"October","November","December")
tripdata$Month <- mymonths[tripdata$Month]
tripdata$Month = factor(tripdata$Month, levels = month.name)
tripdata$Day <- weekdays(tripdata$date)
tripdata$Day<-factor(tripdata$Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))
tripdata$hour<-ifelse(nchar(tripdata$time)==5,substr(tripdata$time,1,2),substr(tripdata$time,1,1))
tripdata$hour<-as.numeric(tripdata$hour)
month_counts <- count(tripdata, vars = Month)
colnames(month_counts) <- c("Month","Number_of_Rides")
month_counts
## # A tibble: 12 x 2
## Month Number_of_Rides
## <fct> <int>
## 1 January 51036
## 2 February 44292
## 3 March 53779
## 4 April 55051
## 5 May 55416
## 6 June 59164
## 7 July 60782
## 8 August 62422
## 9 September 55447
## 10 October 61799
## 11 November 48545
## 12 December 38644
ggplot(month_counts, aes(x=Month,y=Number_of_Rides)) +
geom_bar(stat="identity", fill="red") + ggtitle("Plot of number of rides by Month") +
ylab("Number of Rides")
From this graph, we observed that demand for bikes will generally increase from March to October, except a fall in September. November, December, January and February are also identified to have months with less customers.
#plotting by month and start city
StationData$count<-rownames(StationData)
start.city<-c()
for(i in 1:length(tripdata$id)){
k<-which(tripdata$start_station_name[i] == unique(StationData$name))
start.city[i]<-StationData$city[StationData$count==k]
}
tripdata$start.city<-start.city
end.city<-c()
for(i in 1:length(tripdata$id)){
k<-which(tripdata$end_station_name[i] == unique(StationData$name))
end.city[i]<-StationData$city[StationData$count==k]
}
tripdata$end.city<-end.city
tmp <- tapply(tripdata$id, INDEX = list(tripdata$Month, tripdata$start.city), length)
tmp <- as.data.frame(tmp) #dataframe coercion
tmp$Month = rownames(tmp) #Creating year column as the rownames
tmp
## Mountain View Palo Alto Redwood City San Francisco San Jose
## January 1255 485 220 46252 2824
## February 1231 339 203 40010 2509
## March 1611 499 279 48390 3000
## April 1633 501 292 49583 3042
## May 1729 583 275 49325 3504
## June 1944 652 324 52498 3746
## July 2103 820 396 53874 3589
## August 2012 713 352 55757 3588
## September 1246 609 288 49811 3493
## October 1362 662 286 55760 3729
## November 1095 505 258 43932 2755
## December 862 315 106 35414 1947
## Month
## January January
## February February
## March March
## April April
## May May
## June June
## July July
## August August
## September September
## October October
## November November
## December December
stats <- melt(tmp, id.vars = "Month", variable.name="series")
stats$Month<-factor(stats$Month, levels = month.name)
ggplot(stats, aes(x=Month,y=value)) +
geom_bar(stat="identity", aes(fill = series), position="fill") +
ylab("Proportion of City") +
scale_fill_manual("legend", values = c("Mountain View" = "black", "Palo Alto" = "brown", "Redwood City" = "red", "San Francisco"= "orange", "San Jose" = "yellow"))+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
When the data is split across the different cities, we find that th percentage of bikes from each city is relatively constant. Although the number of trips changes across cities each month, the proportion of bikers are roughly the same from each city.
#Counting trips by day
day_counts <- count(tripdata, vars = Day)
colnames(day_counts) <- c("Day","Number_of_Rides")
day_counts$Day<-factor(day_counts$Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))
#Plotting number of riders by day
ggplot(day_counts, aes(x=Day,y=Number_of_Rides)) + geom_bar(stat="identity", fill="blue") + ggtitle("Plot of number of rides by Day") + ylab("Number of Rides")
By analysing the number of trips by day, we observe that overall there are less rides on weekends (Saturday & Sunday).This could indicate that a majority of bike-share users use these bicycles to commute to work or school on weekdays.
tmp2 <- tapply(tripdata$id, INDEX = list(tripdata$Day, tripdata$start.city), length)
tmp2 <- as.data.frame(tmp2) #dataframe coercion
tmp2$Day = rownames(tmp2) #Creating year column as the rownames
tmp2
## Mountain View Palo Alto Redwood City San Francisco San Jose
## Monday 3453 1006 544 100965 6120
## Tuesday 3507 1028 566 106588 6452
## Wednesday 3311 991 633 104739 6494
## Thursday 3211 936 593 103642 6665
## Friday 2686 987 538 95051 6140
## Saturday 863 822 221 37905 3054
## Sunday 1052 913 184 31716 2801
## Day
## Monday Monday
## Tuesday Tuesday
## Wednesday Wednesday
## Thursday Thursday
## Friday Friday
## Saturday Saturday
## Sunday Sunday
stats2 <- melt(tmp2, id.vars = "Day", variable.name="series")
stats2$Day<-factor(stats2$Day, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday","Sunday"))
ggplot(stats2, aes(x=Day,y=value)) +
geom_bar(stat="identity", aes(fill = series), position="fill") +
ylab("Proportion of City") +
scale_fill_manual("legend", values = c("Mountain View" = "black", "Palo Alto" = "brown", "Redwood City" = "red", "San Francisco"= "orange", "San Jose" = "yellow"))
When we plot to observe variations in proportion of ride across cities, we see that San Francisco experiences a decrease in bike trips on weekends, contributing largely to the decrease in riders on weekends. On the other hand, Palo Alto and San Jose experience higher proportion bike rides on weekends.
hour_counts <- count(tripdata, vars = hour)
colnames(hour_counts) <- c("Hour","Number_of_Rides")
hour_counts$Hour<-factor(hour_counts$Hour,levels = seq(0,24))
#Plotting number of riders by day
ggplot(hour_counts, aes(x=Hour,y=Number_of_Rides)) + geom_bar(stat="identity", fill="green") + ggtitle("Plot of number of rides by Day") + ylab("Number of Rides")
By analysing the number of trips by hour, we observe that the largest concentration of rides occur in the morning from 7-9am and in the evening from 4-7pm. This reinforces our team’s observation that commuters use these bikes to get to-and-from work.
tmp3 <- tapply(tripdata$id, INDEX = list(tripdata$hour, tripdata$start.city), length)
tmp3 <- as.data.frame(tmp3) #dataframe coercion
tmp3$Hour = rownames(tmp3) #Creating year column as the rownames
tmp3$Hour<-as.factor(tmp3$Hour)
tmp3
## Mountain View Palo Alto Redwood City San Francisco San Jose Hour
## 0 19 23 12 1729 287 0
## 1 43 31 6 812 222 1
## 2 8 12 6 554 55 2
## 3 4 11 NA 291 15 3
## 4 3 2 2 974 25 4
## 5 109 5 5 2962 303 5
## 6 428 61 17 12301 1088 6
## 7 1489 200 364 37531 3653 7
## 8 2187 617 401 77101 3151 8
## 9 2549 636 299 55109 2003 9
## 10 733 392 117 26236 1407 10
## 11 513 357 117 25192 1687 11
## 12 525 403 220 29486 2209 12
## 13 452 358 158 27332 1944 13
## 14 415 410 102 23315 1484 14
## 15 601 398 175 29006 1601 15
## 16 1730 537 307 50849 3660 16
## 17 2375 733 339 72106 4721 17
## 18 1961 630 272 49803 3205 18
## 19 1003 422 161 24700 1848 19
## 20 391 195 80 14157 1072 20
## 21 295 133 45 9472 844 21
## 22 169 85 47 6046 739 22
## 23 81 32 27 3542 503 23
stats3 <- melt(tmp3, id.vars = "Hour", variable.name="series")
stats3$Hour<-factor(stats3$Hour,levels = seq(0,24))
#plotting variation in number of rides by day across the different cities
ggplot(stats3, aes(x=Hour,y=value)) + geom_bar(stat="identity", aes(fill = series), position="fill") + ylab("Proportion of City") + scale_fill_manual("legend", values = c("Mountain View" = "black", "Palo Alto" = "brown", "Redwood City" = "red", "San Francisco"= "orange", "San Jose" = "yellow"))
## Warning: Removed 1 rows containing missing values (position_stack).
When we plot to observe variations in proportion of ride across cities, we see that the proportion of number of trips remains even across all five cities. Although there are some variations in trips from 2200hrs to 0500hrs, the number of total trips at these timings are very low thus there will be larger variations in proportion at these timings.
Our team aims to integrate the functions of statistical, geo-spatial and graphical analysis into a single application on bike-sharing services. This app will provide an interactive interface for our users to view the different data visualizations available and also how the different parameters (Date, Time, City, Station) affect the visualization.
To solve the various problems the bike-sharing industry is facing, our team has come up with two suggestions for improvement that these companies could adopt: Dynamic Pricing and Bicycle Redistribution.
The first strategy to implement is dynamic pricing. According to this payment policy, the price of renting a shared bike will change throughout the day. This price will be dependent on the demand (Rate of bikes getting rented) and the availability of bikes at both the starting station and ending station.
For example, getting a bike from a high demand, low availability station will cost users more as compared to getting one from a low demand, high availability one. Similarly, returning the bike to a high demand, low availability station will cost users less as compared to returning one to a low demand, high availability one.
By adopting this pricing strategy, users will be encouraged to pick up their bikes from stations with greater number of bikes and return them at stations with lesser number of bikes. This will help to resolve the issue of an uneven distribution of bikes and also the issue of shortage of bikes at selected stations.
Here is an example of dynamic pricing, taken from Uber:
The second strategy targets the problem of excess and shortages of bicycles in stations directly.
As seen from the ggmaps from our geo-spatial analysis, some areas observe an excess in the number of bikes available and likewise, other areas observe a shortage in the number of bikes available. One easy way to resolve the uneven distribution of bikes is to redistribute the supply of bikes according to the data.
Bikes at stations with an excess number of bikes can be relocated to stations that observe a shortage in the number of bikes available. This strategy is also relatively easier to implement in Singapore as the bikes are not required to put park in docks at the stations, and would only need to be parked in the designated geo-zone.
Through this project, our team has analysed many aspects of the industry of bike-sharing services. We have conducted statistical, geo-spatial and graphical analysis on bike-sharing data and provided improvements to current industry practices. These analysis will be packaged into a application that will help both bike-sharing companies and their consumers alike.
By adopting these recommendations, we hope that the issues of indiscriminate parking and the shortage and excess of bicycles in stations will be reduced.